Q-adic Transform Revisited 



Jean-Guillaume Dumas 
Universite de Grenoble, Laboratoire J. Kuntzmann, umr CNRS 5224- 
BP 53X, 51, rue des Mathematiques. F38041 Grenoble, France. 



Jean-Guillaume . Dumas@imag . f r 



June 23, 2008 



Abstract 

We present an algorithm to perform a simultaneous modular reduction 
of several residues. This enables to compress polynomials into integers and 
perform several modular operations with machine integer arithmetic. The 
idea is to convert the X-adic representation of modular polynomials, with 
X an indeterminate, to a g-adic representation where q is an integer larger 
than the field characteristic. With some control on the different involved 
sizes it is then possible to perform some of the g-adic arithmetic directly 
with machine integers or floating points. Depending also on the number 
of performed numerical operations one can then convert back to the q- 
adic or X-adic representation and eventually mod out high residues, fn 
this note we present a new version of both conversions: more tabulations 
and a way to reduce the number of divisions involved in the process are 
presented. The polynomial multiplication is then applied to arithmetic 
and linear algebra in small finite field extensions. 
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1 Introduction 

The FFLAS/FFPACK project has demonstrated the usefulness of wrapping 
cache-aware routines for efficient small finite field linear algebra [H [5] . 

A conversion between a modular representation of prime fields and e.g. float- 
ing points used exactly is natural. It uses the homomorphism to the integers. 
Now for extension fields (isomorphic to polynomials over a prime field) such a 
conversion is not direct. In [4] we proposed transforming the polynomials into 
a q-adic representation where q is an integer larger than the field characteristic. 
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We call this transformation DQT for Discrete Q-adic Transform, it is a form of 
Kronecker substitution [3 §8.4]. With some care, in particular on the size of q, 
it is possible to map the operations in the extension field into the floating point 
arithmetic realization of this g-adic representation and convert back using an 
inverse DQT. 

In this note we propose some implementation improvements: we propose to 
use a tabulated discrete logarithm for the DQT and we give a trick to reduce 
the number of machine divisions involved in the inverse. This then gives rise to 
an improved DQT which we thus call FQT (Fast Q-adic Transform) . This FQT 
uses a simultaneous reduction of several residues, called REDQ, and some table 
lookup. 

Therefore we recall in section [2] the previous conversion algorithm and dis- 
cuss in section [3] about a floating point implementation of modular reduction. 
This implementation will be used throughout the paper to get fast reductions. 
We then present our new simultaneous reduction in section |4] and show in sec- 
tion [5] how a time- memory trade-off can make this reduction very fast. This 
fast reduction is then applied to modular polynomial multiplication with small 
prime fields in section [6] It is also applied to small extension field arithmetic 
and fast matrix multiplication over those fields in section [7J 

2 Q-adic representation of 
polynomials 

We follow here the presentation of [I] of the idea of [T2] : polynomial arithmetic 
is performed a q— adic way, with q a sufficiently big prime or power of a single 
prime. 

Suppose that a — Yli=o a iX l and b = Y^,i^o PiX 1 are two polynomials 
in Z/pZ[-X"]- One can perform the polynomial multiplication ab via q— adic 

numbers. Indeed, by setting a — Y^%=o ai( ? an< ^ ^ = ^2i=o Pi<l' \ the product is 
computed in the following manner (we suppose that on — Pi — for i > k — 1): 



Now if q is large enough, the coefficient of q % will not exceed q — 1. In this case, 
it is possible to evaluate a and b as machine numbers (e.g. floating point or 
machine integers), compute the product of these evaluations, and convert back 
to polynomials by radix computations (see e.g. Algorithm 9.14]). There just 
remains then to perform modulo p reductions on every coefficient as shown on 
example [T] 

Example 1. For instance, to multiply a = X + 1 by b = X + 2 in Z/gz[X] 
one can use the substitution X = 100: compute 101 x 102 = 10302, use radix 
conversion to write 10302 = q 2 +3q+2 and reduce modulo 3 to get axb = X 2 +2. 




(1) 



We call DQT the evaluation of polynomials modulo p at q and DQT inverse 
the radix conversion of a g-adic development followed by a modular reduction, 
as shown in algorithm [T] 



Algorithm 1 Polynomial multiplication by DQT 

Input Two polynomials v\ and V2 in Z/pZ[A"] of degree less than k. 

Input a sufficiently large integer q. 

Output R £ Z/pZ[X], with R = vi.v 2 . 

Polynomial to q— adic conversion 
1: Set v\ and €2 to the floating point vectors of the evaluations at q of the 
elements of v\ and V2- {Using e.g. Horner's formula} 

One computation 
2: Compute f = V1V2 
Building the solution 

3: f = J^'ito' 2 fiiQ l - {Using radix conversion, see e.g. pj Algorithm 9.14]} 
4: For each i, set /i; = /Tj mod p 
5: set R = Y%-o 2 ^i^ 4 



Depending on the size of q, the results can still remain exact and we obtain 
the following bounds generalizing that of §8.4]: 

Theorem 1. J4jj Let m be the number of available mantissa bits within the 
machine numbers and n q be the number of polynomial products V1.V2 of degree 
k accumulated before the re-conversion. If 

q > n q k(p - l) 2 and (2k - 1) log 2 (g) < m, (2) 

then Algorithm^ is correct. 

Note that the integer q can be chosen to be a power of 2. Then the Horner 
like evaluation (line 1 of algorithm[T|) of the polynomials at q is just a left shift. 
One can then compute this shift with exponent manipulations in floating point 
arithmetic and use native shift operator (e.g. the << operator in C) as soon as 
values are within the 32 (or 64 when available) bit range. 

It is shown on [4, Figures 5 & 6] that this wrapping is already a pretty good 
way to obtain high speed linear algebra over some small extension fields. Indeed 
we were able to reach high peak performance, quite close to those obtained with 
prime fields, namely 420 Millions of finite operations per second (Mop/s) on a 
Pentium III, 735 MHz, and more than 500 Mop/s on a 64-bit DEC alpha 500 
MHz. This was roughly 20 percent below the pure floating point performance 
and 15 percent below the prime field implementation. 



3 Euclidean division by floating point routines 



In the implementations of the proposed subsequent algorithms, we will make 
extensive use of Euclidean division in exact arithmetic. Unfortunately exact 
division is usually quite slow on modern computers. This division can thus be 
performed by floating point operations. Suppose we want to compute r/p where 
r and p and integers. Then their difference is representable by a floating point 
and, therefore, if r/p is computed by a floating point division with a rounding to 
nearest mode, [9l Theorem I] assures that flooring the result gives the expected 
value. Now if a multiplication by a precomputed inverse of p is used (as is done 
e.g. in NTL [13]), proving the correctness for all r is more difficult, see [TU] for 
more details. We therefore propose the following simple lemma which enables 
the use of the rounding upward mode to the cost of loosing only one bit of 
precision: 

Lemma 1. For two positive integers p and r and e > 0, we have 

as lonq as r < jr. 

2e + e 2 

Proof. Consider up < r < up + i with u, i positive integers and i < p. Then 
= u and £(1 + e)(l + e) = u + i + 7 -(2e + e 2 ). The latter is maximal at 
i = p — 1. This proves that flooring is correct as long as ^(2e + e 2 ) < i. □ 

This proves that when rounding towards +oo it is possible to perform the 
division by a multiplication by the precomputed inverse of the prime number 
as long as r is not too large. Since our entries will be integers but stored in 
floating point format this is a potential significant speed-up. 



r 
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, I -(1 + e) J J (1 + e) 



4 REDQ: modular reduction in the DQT do- 
main 

The first improvement we propose to the DQT is to replace the costly modular 
reduction of the polynomial coefficients by a single division by p (or, better, by 
a multiplication by its inverse) followed by several shifts. In order to prove the 
correctness of this algorithm, we first need the following lemma: 

Lemma 2. For r E N and a, b £ N* , 



LIJ 




-1 = 




a 




-ab\ 


b 



Proof. We proceed by splitting the possible values of r into intervals kab < r < 
(k + l)ab, where k = [^jj . Then kb < - < (k + l)b and since kb is an integer 



we also have that kb < |_^J < (k + 1)6. Thus k < -LfJ- < k + 1 and 



Obviously the same is true for the left hand side which proves the lemma. 



k. 
□ 



This idea is used in algorithm [5] to perform several remainderings with a 
single machine division (note that when q is a power of 2, and when elements 
are represented using an integral type, division by q l and flooring are a single 
operation, a right shift). 



Algorithm 2 REDQ 



Input Two integers p and q satisfying the conditions ([2]) . 
Input f = J]i=o fii<f e Z - 

Output p € Z, with p = Y2i=o Mi? 1 where pi = /!,; mod p. 



to d dc 






rop 





rop = 
for i 

it; 

end for 

= u d 

for i = to d — 1 do 

Pi — Ui - mod p; 

end for 

Return p = £\ =0 ; 



Theorem 2. Algorithm REDQ is correct. 



Proof. First we need to prove that < Ui < p. By definition of the truncation, 



we have 



1 < 4|<4and-^-l-i< I < Thus -1 < 

Mi < p + ^2-, which is < u, < p since u; is an integer. We now consider the 
possible case Ui = p and show that it does not happen, m — p means that 

) = pg. This means that pgq 1 < r < pgq 1 + q l . So that in 

g. But 



2-. Thus «<^<o+iso that 

p ^ — q 1 - 3 p 



turns gq l < rop < - < gq 

then from the definition of g we have that g = g — 1 which is absurd. Therefore 
< ^ <p - 1. 

Second we show that = 53j=j ( J, j^~ i m od p. Line[5]of algorithm[5]defines 
i|J' 

9 l 



*J-' 

The latter is m = 

I r\=y d 



and thus lemma [21 gives that u; = 









f 
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mod p. Now, since f = Y2j=a Mj^ i we have that 
Therefore, mod p, the equality is proven. □ 



Note that the last steps are not needed when p divides q. Indeed in this case 
q = mod p. The trick works then simply as shown on example [2] below: 

Example 2. Let a = X 2 + 2X + 3 and b = AX 2 + 5X + 6 unreduced modulo 5. 
Then axb = 40013002800270018, with q = 10000, for which we need to reduce 
five coefficients modulo 5. The trick is that we can recover all the residues at 
once. Line produces rop = [08002600560054003. 6J . It thus contains all the 



quotients 0;0002;0005;0005;0003 and one has then just to multiply by p and sub- 
tract to get: a>Cb = 40013002800270018 - 2000500050003 x 5 = 40003000300020003 
so that axb = 4X 4 + 3A 3 + 3X 2 + 2X + 3. 

Now we can give a full example to show the last corrections required when p 
does not divide q. The first part of the algorithm, lines to [2] is unchanged and 
is used to get small sizes for The second part is then just a small correction 
modulo p to get the correct result. 

Example 3. Take the polynomial R = 1234X 3 + 5678A 2 + 9123A + 4567, 
the prime p — 23 and use q — 10 6 . In this case, the division gives rop — 
[1234005678009123004567/23J = 53652420783005348024. Then the multiplica- 
tion by the prime produces rop x 23 = 1234005678009123004552 so that u = 
4567-4552 =15. We shift to get 1234005678009123 and 53652420783005 x 23 = 
1234005678009115 which gives u\ = 9123 - 9115 = 8. We shift and multiply 
twice to get it 2 = 18 and u 3 = /i 3 = 15 just like in example^ Now —q = — 10 6 
mod 23 = 17 which is non zero and thus we have to compute the corrections 
of lines \^to\^of algorithm [H This can also be formalized as a matrix vector 
product: 

' 1 17 
1 17 
l l = 1 17 U m P 
1 

to get the final result, R = 15X 3 + 20X 2 + 15X + 13 mod 23. 

The algorithm is efficient because one can precompute 1/p, l/q, l/q 2 etc. 
and use multiplication to compute all of the mods. The computation of each Uj 
and fj,i can also be pipelined or vectorized since they are independent. As is, 
the benefit when compared to direct remaindering by p is that the corrections 
occur on smaller integers. Thus the remaindering by p can be faster. Actually, 
another major acceleration can be added: the fact that the [ii are much smaller 
than the initial fii makes it possible to tabulate the corrections as shown next. 



5 Time-Memory trade-off in REDQ 

5.1 A Matrix version of the correction 

Indeed, there is a bijection between the Ui and the This can be viewed 
on the corrections of lines to of algorithm [21 view these corrections as a 
matrix- vector multiplication by a matrix Q c i as in example [3] Then we have 
that: 

"1 -q ... 
'■■ '■• '•• : 

Qd = ■ '•. '-. Q 

-9 

1 



1 q q> ... q* 

'•• '•• '■• : 

; ■•. ••. ■•. 

•- q 

1 



5.2 Tabulations of the matrix- vector product and Time- 
Memory trade-off 

Thus if the multiplication by Qd is fully tabulated, it requires a table of size at 
least p d+1 . But, due to the nature of Qd, we have the relations of figure [TJ 
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Figure 1: Recurring relations on the Qd matrices. 
Therefore, it is very easy to tabulate with a table of size p k only and perform 
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d+l-k 
fc-1 



d 
fc-1 



table accesses as shown on example |4j 



Example 4. Let us compute the corrections for a degree 6 polynomial. One can 
tabulate the multiplication by Qq, a 7 x 7 matrix, with therefore p 7 entries each 
of size at least 71og 2 (p). Or one can tabulate the multiplication by Q2, a 3 x 3 
matrix. To compute [fig, ... , fj,e] T — Qe[uo ■ ■■ , u§] T one can instead use three 
multiplications by Q 2 and discard the last entry for the first two multiplications 
as shown on the following algorithm: 



Algorithm 3 Q§ with an extra memory of size p 3 
Input [lio ... , M6] G Z/pZ 7 - 

Input The table Q2 of the associated 3x3 matrix-vector multiplication over 
Z/pZ- 

Output [u , . . . , ^i 6 ] T = Q 6 [u . . . , u 6 ] T . 
1: no, oi, a 2 — Q2N), Mi, u 2 \, 
2: b ,bi,b 2 = Q2[m2,m 3 ,m 4 ]; 

3: C ,Ci,C 2 = Q2[U4,U 5 ,U 6 }; 

4: Return [u , • . . ,u 6 ] T = [o , ai, 6 , 61, c , ci, c 2 ] T ; 



When 5 is a power of 2, the computation of the Mj, in the first part of 
algorithm [2] requires 1 div & mul & 2d shifts. Now, the time memory 

trade-off enables to compute the second part at a choice of costs given on table[TJ 



5.3 Indexing 

In practice, indexing by a t-uple of integers mod p is made by evaluating at p, as 
Y^Uip 1 . If more memory is available, one can also directly index in the binary 
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Table 1: Time-Memory trade-off in REDQ of degree d over Z/pZ 

format using (2r io S2(p)l) 1 . On the one hand all the multiplications by p are 
replaced by binary shifts. On the other hand, this makes the table grow a little 
bit, fromp fc to 2r io S2(p)l fc . 

6 Comparison with delayed 
reduction for polynomial 
multiplication 

The classical alternative to algorithm [1] to perform modular polynomial multi- 
plication is to use delayed reductions e.g. as in pQ: the idea is to accumulate 
products of the form '^2 i aibk-i, without reductions, while the sum does not 
overflow. Thus, if we use for instance a centered representation modulo p (inte- 
gers from —^2 to ^^-), it is possible to accumulate at least rid products as long 
as 

Mp-1) 2 <2 m+1 (3) 

The modular reduction can be made by many different ways (e.g. classical 
division, floating point multiplication by the inverse, Montgomery reduction, 
etc.), we just call the best one REDC here. It is at most equivalent to 1 machine 
division. 

Now the idea of the FQT (Fast Q-adic Transform) is to represent modular 

polynomials of the form P — Yli=o a i^ 1 by P — Pi (X d+i y where the Pi are 
degree d polynomials stored in a single integer in the q-adic way. 
Therefore, a product PQ has the form 

£(£iWt-i) (jt*- 1 )*. 

There, each multiplication PiQt-i is made by algorithm [T] on a single machine 
integer. The reduction is made by a tabulated REDQ and can also be delayed 
now as long as conditions |2]) are guaranteed. 

We want to compare these two strategies. We thus propose the following 
complexity model: we count only multiplications and additions in the field 
as an atomic operation and separate the machine divisions. We for instance 
approximate REDC by a machine division. We call REDQj, a simultaneous 
reduction of k residues. In our complexity model a REDQ& thus requires 1 



division and 2k multiplications and additions. We also call d— FQT the use of a 
degree d g-adic substitution. Thus a multiplication PiQt-i in a d— FQT requires 
the reduction of 2c? + 1 coefficients, i.e. a REDQ2d+i- 

Let P be a polynomial of degree N with indeterminate " X" . If we use a 
d— FQT, it will then become a polynomial of degree D q in the indeterminate 
Y = X d+1 . Thus, 

~N + 1" 



d+l 



1. 



Table [2] gives the respective complexities of both strategies. 
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Table 2: Modular polynomial multiplication complexities. 



For instance, with p = 3, N — 500, if we choose a double floating point 
representation and a degree 4 DQT, the fully tabulated FQT boils down to 10 5 
multiplications and additions and 4.10 3 divisions. For the same parameters, the 
classical polynomial multiplication algorithm requires 10 6 multiplications and 
additions and only 10 3 remaindering. This is roughly 9 times more operations 
as shown on figure O 

Even by switching to a larger mantissa, say e.g. 128 bits, so that the DQT 
multiplications are roughly 4 times costlier than double floating point opera- 
tions, this can still be useful: take p — 1009 and choose d = 3, still gives around 
10 5 multiplications and additions over 128 bits and 4.10 3 divisions. This makes 
8 times less operations. This should therefore still be faster than the delayed 
over 32 bits. 

On figure [2] we compare also our two implementations with that of NTL 
[T3"] . We see that the FQT is faster than NTL as long as better algorithms 
are not used. Indeed the change of slope in NTL's curve reflects the use of 
Karatsuba's algorithm for polynomial multiplication. One should note that NTL 
also proposes a very optimized modulo 2 implementation which is an order of 
magnitude faster than our implementation on small primes. There is therefore 
room for more improvements on small fields. Our strategy is anyway very useful 
for small degrees and small primes. Furthermore, we have not implemented the 
FQT as the base case of faster recursive algorithms such as Karatsuba, Toom- 
Cook, etc. The figure shows that these recursive algorithms together with the 
FQT could be the fastest. 

In particular, the FQT already improves the speed of small finite field ex- 
tension's arithmetic as shown next. 



polymul/s Polynomial mutiplication modulo 3 on a Xeon 3.6GHz 
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Figure 2: Polynomial multiplications modulo 3 per second on a Xeon 3.6 GHz 

7 Application to small finite 
field extensions 

The isomorphism between finite fields of equal sizes gives us a canonical repre- 
sentation: any finite field extension is viewed as the set of polynomials modulo 
a prime p and modulo an irreducible polynomial V of degree k. Clearly we 
can thus convert any finite field element to its g-adic expansion ; perform the 
FQT between two elements and then reduce the obtained polynomial modulo 
V . Furthermore, it is possible to use floating point routines to perform exact 
linear algebra as demonstrated in [6]. 

Our strategy here, see algorithm^ is thus to convert vectors over GF(p fe ) to 
g-adic floating point, to call a fast numerical linear algebra routine (BLAS) and 
then to convert the floating point result back to the usual field representation. 
In this paper we propose to improve all the conversion steps of [H algorithm 4.1] 
and thus approach the performance of the prime field wrapping also for small 
extension fields: 

1. Replace the Horner evaluation of the polynomials, to form the g-adic ex- 
pansion, by a single table lookup, recovering directly the floating point 




representation. 



2. Replace the radix conversion and the costly modular reductions of each 
polynomial coefficient, by a single REDQ operation. 

3. Replace the polynomial division by two table lookups and a single field 
operation. 

Indeed, suppose the internal representation of the extension field is already 
by discrete logarithms and uses conversion tables from polynomial to index 
representations. See e.g. [T] for more details. Then we choose a time-memory 
trade-off for the REDQ operation of the same order of magnitude, that is to 
say p k . The overall memory required by these new tables only doubles and 
the REDQ requires only 2 accesses. Moreover, in the small extension, the 
polynomial multiplication must also be reduced by an irreducible polynomial, 
V. We show next that this reduction can be precomputed in the REDQ table 
lookup and is therefore almost free. 

Moreover, many things can be factorized if the field representation is by 
discrete logarithms. Indeed, the element are represented by their discrete log- 
arithm with respect to a generator of the field, instead of by polynomials. In 
this case there are already some table accesses for many arithmetic operations, 
see e.g. [TJ §2.4] for more details. 

More precisely, we here propose algorithm 0] for linear algebra over extension 
fields: line 1 is the table look-up of floating point values associated to elements 
of the field ; line 2 is the numerical computation ; line 3 to 7 is the first part 
of the REDQ reduction ; line 8 and 9 are a time-memory trade-off with two 
table accesses for the corrections of REDQ, combined with a conversion from 
polynomials to discrete logarithm representation ; the last line 10 combines the 
latter two results, inside the field. 

A variant of REDQ is used in algorithm^ but Ui still satisfies Ui = Y^j=i 2 Mi? 
mod p as shown in theorem [5] Therefore the representations of ^iX J in the 
field can be precomputed and stored in two tables where the indexing will be 
made by (uq, . . . ,it/._i) and (itfc_i, . . . , U2/8-2) an d n °t by the //j's as shown 
next. 

Theorem 3. Algorithm^ is correct. 

Proof. There remains to prove that it is possible to compute L and H from the 
itj. From the equality above, we see that fi2k-2 = U2k-2 and Mi — u i ~ Q u i+i 
mod p, for i = 0..(2k — 3). Therefore a precomputed table of p k entries, indexed 
by (wo, . . • , Ufc_i), can provide the representation of 

fc-2 

L = 2j(uj - qu i+ i mod p)X % . 
Another table with p k entries, indexed by (ttfc-i, . . . ,112k— 2), c & n provide the 



Algorithm 4 Fast Dot product over Galois fields via FQT and FQT inverse 

Input a field GF(p k ) with elements represented as exponents of a generator of 
the field. 

Input Two vectors v\ and i»2 of elements of GF(p k ). 
Input a sufficiently large integer q. 
Output R e GF(p k ) 1 with R = vf.v 2 . 

Tabulated q— adic conversion 

{Use conversion tables from exponent to floating point evaluation} 
1: Set v\ and V2 to the floating point vectors of the evaluations at q of the 
elements of v\ and vi . 

The floating point computation 

2: Compute r = v\ T V2'-, 

Computing a radix decomposition 



3: r = [r 



{r = f but we might need a conversion to an integral type} 



rop - 

for i = to 2k - 2 do 



ttj = 
end for 



P 



rop 

— J' 

Tabulated radix conversion to exponents of the generator 



{/ii is such that = fa mod p for f = X)i=o 2 
8: Set L = representation(^2 k ~Q HiX 1 ). 
9: Set if = representation^- 1 x Y%t~k-i faX i - k+1 ). 

Reduction in the field 
10: Return R = H + L € GF(p k ); 



representation of 

2k-3 

H = u 2 k-2X 2k - 2 + ^2 (ut- qu i+1 mod p)X\ 

i=k—l 

Finally R = X k ~ x x Yn=k-i Vi xi ~ k+1 + Eto ^ needs to be reduced 
modulo the irreducible polynomial used to build the field. But, if we are given 
the representations of H and L in the field, R is then equal to their sum inside 
the field, directly using the internal representations. □ 

Table [3] recalls the respective complexities of conversion phase in the two 
presented algorithms. 

Figure [5] shows only the speed of the conversion after the floating point 
operations. The log scales prove that for q ranging from 2 1 to 2 26 (on a 32 bit 





Alg.ffl 


Aig.a 


Alg.S 


Memory 


3p k 




4p fe + 2 fc r io S2Pl+ 1 


Shift 


4k -2 


4fc -2 


4k- 2 


Add 


4fc-4 





2k- 1 


Axpy 





4fc-3 


2fc- 1 


Div 


2k- 1 








Table 





3 


3 


Red 


> 5k 


4 


4 











Table 3: Complexity of the back and forth conversion between extension field 
and floating point numbers 




Figure 3: Speed of finite field Winograd matrix multiplication on a XEON, 3.6 
GHz 



Xeon) our new implementation is two to three times faster than the previous 
one. 

Furthermore, these improvements e.g. allow the extension field routines to 
reach the speed of 7800 millions of GF(9) operations per second (on a XEON, 
3.6 GHz, using Goto BLAS-1.09 dgemm as the numerical routine [8] and FFLAS 
fgemm for the fast prime field matrix multiplication [5]) as shown on figure [31 
The FFLAS routines are available within the LinBox 1.1.4 library [TT] and the 
FQT is in implemented in the givgf qext .h file of the Givaro 3.2.9 library [3]. 

With these new implementations, the obtained speed-up shown in figure [3] 
represents a reduction from the 15 percent overhead of the previous implemen- 
tation to less than 4 percent now, when compared to GF(ll). 




8 Conclusion 

We have proposed a new algorithm for simultaneous reduction of several residues 
stored in a single machine word. For this algorithm we also give a time-memory 
trade-off implementation enabling very fast running time if enough memory is 
available. 

We have shown very effective applications of this trick for both modular 
polynomial multiplication, and extension fields conversion to floating point. The 
latter allows efficient linear algebra routines over small extension fields but also 
linear algebra over small prime fields as shown in [2] . 

Further improvements include comparison of running times between choices 
for q. Indeed our experiments were made with q a power of two and large table 
lookup. With q a multiple of p the table lookup is not needed but divisions by 
q % will be more expensive. 

It would also be interesting to see how does the trick extend in practice to 
larger precision implementations: on the one hand the basic arithmetic slows 
down, but on the other hand the trick enables a more compact packing of 
elements (e.g. if an odd number of field elements can be stored inside two 
machine words, etc.). 
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